Introduction

The majority of data sets collected by researchers in all disciplines are mul- tivariate, meaning that several measurements, observations, or recordings are taken on each of the units in the data set.

Most multivariate data sets can be represented in the same way, namely in a rectangular format known from spreadsheets, in which the elements of each row correspond to the variable values of a particular unit in the data set and the elements of the columns correspond to the values taken by a particular variable.

Type of covariates

Four levels of measurements are often distinguished:

In many statistical textbooks, discussion of different types of measurements is often followed by recommendations as to which statistical techniques are suitable for each type; for example, analyses on nominal data should be limited to summary statistics such as the number of cases, the mode, etc.

And, for ordinal data, means and standard deviations are not suitable. But Velleman and Wilkinson (1993) make the important point that restricting the choice of statistical methods in this way may be a dangerous practise for data analysis–in essence the measurement taxonomy described is often too strict to apply to real-world data.

Missing values

Missing values in multivariate data may arise for a number of reasons; for example, non-response in sample surveys, dropouts in longitudinal data, or refusal to answer particular questions in a questionnaire. The most important approach for dealing with missing data is to try to avoid them during the data-collection stage of a study. But despite all the efforts a researcher may make, he or she may still be faced with a data set that contains a number of missing values.

One option to deal with missing data is to take the complete-case analysis route because this is what most statistical software packages do automatically. Using complete-case analysis on multivariate data means omitting any case with a missing value on any of the variables. It is easy to see that if the number of variables is large, then even a sparse pattern of missing values can result in a substantial number of incomplete cases. One possibility to ease this problem is to simply drop any variables that have many missing values. But complete-case analysis is not recommended for two reasons:

So, at the very least, complete-case analysis leads to a loss, and perhaps a substantial loss, in power by discarding data, but worse, analyses based just on complete cases might lead to misleading conclusions and inferences.

A relatively simple alternative to complete-case analysis that is often used is available-case analysis. This is a straightforward attempt to exploit the incomplete information by using all the cases available to estimate quantities of interest. For example, if the researcher is interested in estimating the correlation matrix of a set of multivariate data, then available-case analysis uses all the cases with variables X_{i} and X_{j} present to estimate the correlation between the two variables.

Both complete-case and available-case analyses are unattractive unless the number of missing values in the data set is “small”.

The scatterplot

The scatterplot is the standard for representing continuous bivariate data but, as we shall see later in this chapter, it can be enhanced in a variety of ways to accommodate information about other variables.

library(curl)
## Using libcurl 8.4.0 with LibreSSL/3.3.6
trees <- read.csv(curl("https://raw.githubusercontent.com/orrortega/statistics-natural-resources/main/Day_II/Data/trees.csv"))
str(trees)
## 'data.frame':    1000 obs. of  5 variables:
##  $ site  : int  4 5 2 5 1 1 2 2 2 1 ...
##  $ dbh   : num  29.7 33.3 28 39.9 47.9 ...
##  $ height: num  36.1 42.3 41.9 46.5 43.9 26.2 29.8 35.6 42.1 36.5 ...
##  $ sex   : chr  "male" "male" "female" "female" ...
##  $ dead  : int  0 0 0 0 0 0 0 0 0 0 ...
dlab <- "Diameter (cm)"
hlab <- "Height (m)"

plot(height ~ dbh, data = trees,  xlab = dlab, ylab = hlab)

The bivariate boxplot

plot(height ~ dbh, data = trees,  xlab = dlab, ylab = hlab)
rug(trees$dbh, side = 1)
rug(trees$height, side = 2)

layout(matrix(c(2, 0, 1, 3), nrow = 2, byrow = TRUE),
        widths = c(2, 1), heights = c(1, 1), respect = TRUE)

xlim <- with(trees, range(dbh)) * 1.1

plot(height ~ dbh, data = trees, cex.lab = 0.9,
      xlab = dlab, ylab = hlab, type = "n", xlim = xlim)

with(trees, text(dbh, height, cex = 0.6,
      labels = abbreviate(row.names(trees))))

with(trees, hist(dbh, main = "", xlim = xlim))

with(trees, boxplot(height))

library(MVA)
## Loading required package: HSAUR2
## Loading required package: tools
par(mfrow=c(1,2))
with(trees, plot(height, dbh,
                            xlab = dlab, ylab = hlab, cex.lab =0.9)) 
with(trees, chiplot(dbh, height))

Bubble plot

fires=read.csv(curl("https://raw.githubusercontent.com/orrortega/statistics-natural-resources/main/data/Fires.csv"))
fires$dominant.spread.direction<-as.factor(fires$dominant.spread.direction)
str(fires)
## 'data.frame':    60 obs. of  13 variables:
##  $ Region                   : chr  "Australia" "Australia" "Australia" "Australia" ...
##  $ ID                       : num  986535 880609 979997 887061 883061 ...
##  $ lat                      : num  -19.6 -16.8 -17.9 -24.4 -19.7 ...
##  $ lon                      : num  134 133 132 136 132 ...
##  $ year                     : num  2007 2014 2004 2011 2011 ...
##  $ start.DOY                : num  232 251 276 251 232 249 244 244 232 263 ...
##  $ end.DOY                  : num  303 322 323 281 278 275 311 287 273 319 ...
##  $ Size..km2.               : num  40026 37289 33228 23869 22561 ...
##  $ duration..days.          : num  72 72 48 31 47 27 68 44 42 57 ...
##  $ expansion..km2.day.1.    : num  556 518 692 770 480 ...
##  $ fire.line.length..km.    : num  298 327 360 389 294 ...
##  $ speed..km.day.1.         : num  18.8 13.8 14.6 15.3 12.8 ...
##  $ dominant.spread.direction: Factor w/ 8 levels "east","north",..: 4 6 6 4 4 4 6 6 4 4 ...
ylim <- with(fires, range(expansion..km2.day.1.)) * c(0.95, 1)

plot(expansion..km2.day.1. ~ speed..km.day.1., data = fires,
      xlab = "Average speed (km/day)",
      ylab = "Average expansion (km2/day)", pch = 10,
ylim = ylim)
with(fires, symbols(speed..km.day.1., expansion..km2.day.1., circles = Size..km2., inches = 0.5, add = TRUE))

require(grDevices)
library(palmerpenguins)

Star Plot

library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:HSAUR2':
## 
##     household
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ readr::parse_date() masks curl::parse_date()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggiraphExtra)
cbp1 <- c("#999999", "#E69F00", "#56B4E9", "#009E73",
          "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
penguins %>%
    remove_missing() %>%
    select(-island, -year) %>%
    ggRadar(aes(x = c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g), 
                group = species,
                colour = sex, facet = sex), 
            rescale = TRUE, 
            size = 1, interactive = FALSE, 
            use.label = TRUE) +
     scale_color_manual(values = cbp1) +
  scale_fill_manual(values = cbp1) +
  theme_bw() +
     scale_y_discrete(breaks = NULL) + # don't show ticks
      labs(
          title = "Radar/spider/star chart", 
          subtitle = "Body mass of male & female penguins per species",
          caption = "Source: https://github.com/allisonhorst/palmerpenguins")
## Warning: Removed 11 rows containing missing values or values outside the scale
## range.

## The scatterplot matrix

mat <- penguins %>%
  remove_missing() %>%
  select(bill_depth_mm, bill_length_mm, body_mass_g, flipper_length_mm)
## Warning: Removed 11 rows containing missing values or values outside the scale
## range.
# Correlation panel
panel.cor <- function(x, y){
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- round(cor(x, y), digits=2)
    txt <- paste0("R = ", r)
    cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}
# Customize upper panel
upper.panel<-function(x, y){
  points(x,y, pch = 19, col = cbp1[penguins$species])
}
# Create the plots
pairs(mat, 
      lower.panel = panel.cor,
      upper.panel = upper.panel)
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
## 
##     some
## The following object is masked from 'package:dplyr':
## 
##     recode
peng <- penguins %>%
    rename(
          bill_depth = bill_depth_mm,
          bill_length = bill_length_mm,
          flipper_length = flipper_length_mm, 
          body_mass = body_mass_g
         ) %>%
  mutate(species = as.factor(species),
         island = as.factor(island),
         sex = as.factor(substr(sex,1,1)))

scatterplotMatrix(~ bill_depth + bill_length  + flipper_length + body_mass | species,
                  data=peng,
                  ellipse=list(levels=0.68),
                  col = scales::hue_pal()(3),
                  legend=list(coords="bottomright"))

library(GGally)
ggpairs(peng, mapping = aes(color = species), 
        columns = c("bill_length",    "bill_depth",    
                    "flipper_length", "body_mass",
                    "island", "sex"))

library("scatterplot3d")
with(penguins, {scatterplot3d(x = mat$bill_depth_mm, y = mat$bill_length_mm, z = mat$body_mass_g, main="3-D Scatterplot Height vs Diameter")
})

library(lattice)
plot(xyplot(body_mass_g ~ bill_depth_mm| cut(bill_length_mm, 2), data = penguins))

library(lattice)
plot(xyplot(body_mass_g ~ bill_depth_mm| cut(bill_length_mm, 4), data = penguins))

Three dimensional plots

flipper_length<- with(penguins, equal.count(flipper_length_mm,4))
plot(cloud(body_mass_g ~ bill_depth_mm * bill_length_mm | flipper_length, panel.aspect = 0.9,  data = penguins))